SYSTEMS WITH LARGE FLEXIBLE SERVER POOLS: 
INSTABILITY OF "NATURAL" LOAD BALANCING 

ALEXANDER L. STOLYAR AND ELENA YUDOVINA 

Abstract. We consider general large-scale service systems with multiple customer classes 
and multiple server (agent) pools; mean service times depend both on the customer class and 
server pool. It is assumed that the allowed activities (routing choices) form a tree (in the 
graph with vertices being both customer classes and server pools). We study the behavior of 
the system under a natural (load balancing) routing/scheduling rule, Longest-queue freest- 
server (LQFS-LB), in the many-server asymptotic regime, such that the exogenous arrival 
rates of the customer classes, as well as the number of agents in each pool, grow to infinity 
in proportion to some scaling parameter r. Equilibrium point of the system under LQBS-LB 
is the desired operating point, with server pool loads minimized and perfectly balanced. 

Our main results are as follows, (a) We show that, quite surprisingly (given the tree 
assumption) , for certain parameter ranges, the fluid limit of the system may be unstable in 
the vicinity of the equilibrium point; such instability may occur if the activity graph is not 
"too small", (b) Using (a), we demonstrate that the sequence of stationary distributions of 
diffusion- scaled processes (measuring 0{y/r) deviations from the equilibrium point) may be 
non-tight, and in fact may escape to infinity, (c) In one special case of interest, however, we 
show that the sequence of stationary distributions of diffusion-scaled processes is tight, and 
the limit of stationary distributions is the stationary distribution of the limiting diffusion 
process. 



1. Introduction 

Large-scale service systems (such as call centers) with heterogeneous customer and server 
(agent) populations bring up the need for efficient dynamic control policies that match 
arriving (or waiting) customers and available servers. In this setting, two goals are desirable. 
On the one hand, customers should not be kept waiting if this is possible. On the other 
hand, idle time should be distributed fairly among the servers. For example, one would like 
to avoid the situation in which one of the server pools is fully busy while another one has 
significant numbers of idle agents. 

Consider a general system, where the arrival rate of class i customers is Aj, the service 
rate of a class i customer by type j agent is iiij, and the server pool sizes are Bj. Another 
very desirable feature of a dynamic control is insensitivity to parameters Aj and /lij. That is, 
the assignment of cTistomers to server pools should, to the maximal degree possible, depend 
only on the current system state, and not on prior knowledge of arrival rates or mean service 
times, because those parameters may not be known in advance and, moreover, they may be 
changing in time. 

If the system objective is to minimize the maximum average load of any server pool, a 
"static" optimal control can be obtained by solving a linear program, called static planning 
problem (SPP), which has B/s, iii/s and Aj's as parameters. An optimal solution to the 
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SPP will prescribe optimal average rates Aj^ at which arriving customers should be routed 
to the server pools. Typically (in a certain sense) the solution to SPP is unique and the 
basic activities, i.e. routing choices (ij) for which Aij > 0, form a tree; let us assume this is 
the case. It is possible to design a dynamic control policy, which achieves the load balancing 
objective without a priori knowledge of input rates Aj - the Shadow Routing pohcy in [12] 
does just that, and in the process it "automatically identifies" the basic activity tree. Shadow 
Routing policy, however, does need to "know" the service rates fj,ij. 

The key question we address in this paper is as follows. Suppose, a control policy does not 
know the service rates /ij^, but "somehow" it does know the structure of the basic activity 
tree, and restrict routing to this tree only. (For example, all feasible activities, i.e. those 
{ij)^s for which fiij > 0, may form a tree simply by the structure of the system. Another 
example: if Shadow Routing has some estimates of fiij, this will not be sufficient for it to 
identify the optimal routing rates, but may very well be sufficient to correctly identify the 
basic activity tree.) What is an efficient load balancing policy in this case? 

If routing is restricted to a tree, it is very natural to conjecture that simple policies of the 
type considered by Gurvich and Whitt [7], Atar-Shaki-Shwartz p], Armony and Ward [T], 
which are of the "serve longest queue" and "join least loaded pool" type, should "typically 
be good enough". Some of the results in these (and other) papers, in fact, prove optimal 
behavior of simple load balancing schemes on a finite time interval; which further supports 
the above informal conjecture. One of the main contributions of our work is to show that, 
surprisingly, the above conjecture is not correct for a general parameter setting. The key 
reason is that a "natural" load balancing, even if it is done along an a priori given optimal 
tree, may render the system unstable in the vicinity of equilibrium point. 

The specific control rule we analyze in this paper can be seen as a special case of the 
Queue-and-Idleness- Ratio rule considered in [7|. Within the given (basic) activity tree, if 
an arriving customer sees multiple available servers, it will choose the server pool with the 
smallest load; while if a server sees several customers waiting in queues, it will take a customer 
from the longest queue. We call this rule Longest-queue freest-server (LQFS-LB). 

We consider a many-server asymptotic regime, such that Aj = A^r (or sometimes Aj = 
Ajr + 0{y/r)), Bj = (3jr, where Aj and /3j are some positive constants, r — cxd is a scaling 
parameter, and Hij remain constant. Our key results show that the fluid limit of the system 
process (obtained via space-scaling by 1/r) can be unstable in the vicinity of the equilibrium 
point. This is very counterintuitive, because it would be reasonable to expect the contrary: 
that a simple load balancing in a system with activity graph free of cycles would be "well 
behaved" . 

Using the fluid limit local instability (when such occurs), we prove that the sequence of 
stationary distributions of diffusion- scaled processes (measuring 0{y/r) deviations from the 
equilibrium point) may be non-tight, and in fact may escape to inflnity. This of course means, 
in particular, that the behavior of the diffusion limit in the vicinity of equilibrium point on 
a finite time interval, may not be relevant to the system behavior in steady state, because 
the system "does not spend any time" in the ©(a/t) -vicinity of the equilibrium point. 

In addition to the instability examples, we prove that in several cases the fluid limit will 
be (at least locally) stable. We demonstrate that fluid limit of any underloaded system with 
at most two customer classes, or critically loaded system with at most four customer classes, 
is always locally stable. We also demonstrate local stability in the case when the service rate 
depends only on the customer type (but not server pool, as long as it can serve it). In the 
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case when the service rate depends only on the server type (but not customer type, as long 
as it can be served), we show more - the global stability of the fluid limit. 

General results on the asymptotics of stationary distributions (most importantly - their 
tightness), especially in the many-server systems' diffusion limit, are notoriously difficult to 
derive. (For recent results in this direction see O |6].) In the special case when the service 
rate depends only on the server type, we prove that under the LQFS-LB policy the sequence 
of stationary distributions of diffusion-scaled processes is tight, and the limit of stationary 
distributions is the stationary distribution of the limiting diffusion process. 

The structure of the paper is as follows. In Section |2] we present the model, define the static 
planning problem and related notions, and define the LQFS-LB policy. In Section [3] we define 
fluid models of the system, derive their basic properties in the vicinity of an equilibrium point, 
and define local stability. Section |4] contains fluid model stability results in the two special 
cases when the service rate depends on server class only or on customer type only. Our key 
results on local instability of fluid models are presented in Section |5| In Section |6] we consider 
an underloaded system (with optimal average utilization being 1 — e < 1), and prove possible 
evanescence of stationary distributions of the diffusion scaled processes. Finally, Section [7] 
considers the so caled Halfin-Whitt asymptotic regime (where the optimal average utilization 
is 1 — 0(l/\/r)), and contains two main results on the asymptotics of stationary distributions 
of the diffusion scaled processes: (a) possible evanescence under certain parameters and (b) 
tightness (and "limit interchange" ) result for the case when the service rate depends only on 
the server type. 



2.1. The model; Static Planning (LP) Problem. Consider the model in which there 
are / customer classes, or types, labeled 1,2, . . . ,/, and J server (agent) pools, or classes, 
labeled 1,2, . . . , J (generally, we will use the subscripts i' for customer classes, and j, f 
for server pools). The sets of customer classes and server classes will be denoted by X and 
J respectively. 

We are interested in the scaling properties of the system as it grows large. The meaning of 
"grows large" is as follows. We consider a sequence of systems indexed by a scaling parameter 
r. As r grows, the arrival rates and the sizes of the service pools, but not the speed of service, 
increase. Specifically, in the rth system, customers of type i enter the system as a Poisson 
process of rate A[ = rAj -|- o(r), while the jth server pool has r/3j individual servers. (All Aj 
and /3j are positive parameters.) Customers may be accepted for service immediately upon 
arrival, or enter a queue; there is a separate queue for each customer type. Customers do not 
abandon the system. When a customer of type i is accepted for service by a server in pool 
j, the service time is exponential of rate Hij] the service rate depends both on the customer 
type and the server type, but not on the scaling parameter r. If customers of type i cannot 
be served by servers of class j, the service rate is /Xjj = 0. 

We would like to balance the proportion of busy servers across the server pools, while 
keeping the system operating efficiently. Let A^^- be the average rates at which type i cus- 
tomers are routed to server pools j. We would like the system state to be such that A[j- 
are close to Aj^r, where {Xij} is an optimal solution to the following static planning problem 
(SPP), which is the following linear program: 
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subject to 

(2) A,, >0, V^,j 

(3) ^ Xij = Xi, \/i 



(4) 5^A.,/(/3,/x.,)<p, Vj. 

i 

We assume that the SPP has a unique optimal solution {Xij, i E X, j ^ J}iP] and it 
is such that the basic activities, i.e. those pairs, or edges, {ij) for which Xij > 0, form a 
(connected) tree in the graph with vertices set XU JT". The set of basic activities is denoted £. 
These assumptions constitute the complete resource pooling (CRP) condition, which holds 
"generically" ; see [121 Theorem 2.2]. For a customer type i, let S{i) = {j : (ij) G £}; for a 
server type j, let C(j) = {i : (ij) G £}. 

Note that under the CRP condition, all ( "server pool capacity" ) constraints Q are bind- 
ing; in other words, the optimal solution to SPP minimizes and "perfectly balances" server 
pool loads. Optimal dual variables z/j, z G X, and aj, j G JT", corresponding to constraints ^ 
and respectively, are unique and all strictly positive; Ui is interpreted as the "workload" 
associated with one type i customer, and aj is interpreted as the (scaled by 1/r) maximum 
rate at which server pool j can process workload. The following relations hold: 

aj = max 1^1(3 jfiij z/j = mmaj/{f3j^ij) 

i j 

XI = ^ ^^^^ = pY(^j = p- 

j i 

If p < 1, the system is called underloaded; if p = 1, the system is called critically loaded. 
In this paper we consider both cases. 

In this paper, we assume that the basic activity tree is known in advance, and restrict our 
attention to the basic activities only. Namely, we assume that a type i customer service in 
pool j is allowed only if {ij) G £. (Equivalently, we can a priori assume that £ is the set of 
all possible activities, i.e. p^j = when {ij) ^ £, and £ is a tree. In this case CRP requires 
that all feasible activities are basic.) 

Let tplj = Xij/nij. Continuing our interpretation of the optimal operating point of the 
system, let {t) be the number of servers of type j serving customers of type i at time t. It 
is desirable to have ^lj{t) = Tip*- + o(r). Later on we will be also interested in the question 
of whether or not the o(r) term can in fact be 0{^Jr). 

2.2. Longest-queue, freest-server load balancing algorithm (LQFS-LB). For the 

rest of the paper, we analyze the performance of the following intuitive load balancing 
algorithm. 

We introduce the following notation (for the system with scaling parameter r): 
\E'^j(t): the number of servers of type j serving customers of type i at time t; 
\E'j(t) = Ylii ^ij(^)- the total number of busy servers of type j at time t; 
^r(^) — Tlij ^\j{^)'- the total number of servers serving type i customers at time t; 
S^(t) = ^'j(t)//3j: the instantaneous load of server pool j at time t; 
Q[(t): the number of customers of type i waiting for service at time t; 
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XI' (t) = \E'[(t) + Qiit): the total number of customers of type i in the system at time 
t. 

The algorithm consists of two parts: routing and scheduling. "Routing" determines where 
an arriving customer goes if it sees available servers of several different types. "Scheduling" 
determines which waiting customer a server picks if it sees customers of several different 
types waiting in queue. 

Routing: If an arriving customer of type i sees any unoccupied servers in server classes 
in S{i), it will pick a server in the least loaded server pool, i.e. j G argminjg5(j) (Ties 
are broken in an arbitrary Markovian manner.) 

Scheduling: If a server of type j, upon completing a service, sees a customer of a class 
in C(j) in queue, it will pick the customer from the longest queue, i.e. i G argmaxjgco) Ql- 
(Ties are broken in an arbitrary Markovian manner.) 

By Remark 2.3], the LQFS-LB algorithm described here is a special case of the algo- 
rithm proposed by Gurvich and Whitt, with constant probabilities Pi = j (queues "should" 

be equal), Vj = (the proportion of idle servers "should" be the same in all server pools). 

2.3. Basic notation. Vector G X), where C, can be any symbol, is often written as 

(6) or ^x; similarly, E J) = {^j) = ij and {iij, (ij) E £) = = ^e- We will 

treat {^ij) = ^£ as a. vector, even though its elements have two indices. Unless specified 
otherwise, = Zliec(j) ^^"^ = ^jes{i) ^v- functions (or random processes) 

{^{t), t > 0) we often write ^(■). (And similarly for functions with domain different from 
[0, oo).) So, for example, (^i(-)) and ^x(-) both signify {{C,i(t),i G X), t > 0). The indicator 
function of a set A is denoted 1a', that is, 1a('^) = I ii uj E A and otherwise. 

The symbol =^ denotes convergence in distribution of either random variables in the 
Euclidean space M*^ (with appropriate dimension d), or random processes in the Skorohod 
space D'^[ri,oo) of ROLL (right-continuous with left limits) functions on [77,00), for some 
constant i] > 0. (Unless explicitly specified otherwise, i] = 0.) The symbol denotes the 
weak convergence of probability measures on M*^, or its one-point compactification M.'^ = 
M"^ U {*}, where * is the "point at infinity". We always consider the Borel a-algebras on M*^ 
and W. 

Standard Euclidean norm of a vector x G M'^ is denoted The symbol — )■ denotes 
ordinary convergence in M'^ or M'^. Abbreviation u.o.c. means uniform on compact sets 
convergence of functions, with the argument (usually in [0, 00)) which is clear from the 
context; w.p.l means convergence with probability 1; f{t) means {d / dt) f (t) . Transposition 
of a matrix H is denoted W; in matrix expressions vectors are understood as column- vectors. 



3. Fluid model 

3.1. Definition. We now consider the behavior of fluid models associated with this system. 
A fluid model is a set of trajectories that w.p.l contains any limit of fluid-scaled trajectories 
of the original stochastic system. (We postpone proving this relationship between the fluid 



models and fluid limits until Section 3.4, in order to not interrupt the main content of 



Section |3| for now, we just formally define fluid models. 

The term fluid model denotes a set of Lipschitz continuous functions 

{(«,(■)), (x.(-)),(g.(-)),(V^..(-)),(p.(-))}, 
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which satisfy the equations below. (Here aj(-) = (aj(t), t > 0), and similarly for other 
components.) The last two equations involving derivatives are to be satisfied at all regular 
points t, when the derivatives in question exist. The interpretation of the components is as 
follows: ai{t) is the total number (actually, "amount", i.e. the number, scaled by 1/r) of 
arrivals of type i customers into the system by time t, Xi{t) is the number ("amount") of 
customers of type i in the system at time t, qi{t) is the number ("amount") of customers of 
type i waiting in queue at time t, ^Jijif) is the number ("amount") of customers of type i 
being served by servers of type j at time t, and pj(t) is the instantaneous load (proportion 
of busy servers, the limit of 'EJ'-{t)/r) in server pool j. 

(5a) ttiit) = \it, Vi G X 

(5b) Xi{t) =qi{t) + J2 ^ij {t),^i^I 

j 

(5c) Xi{t) = Xi{0) + aj(t) - / fiiji!ij{s)ds, 

, Jo 
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(5d) p^.(t) = ^^^^^.(t), VjG J 

(5e) pjit) = 1 if qiit) > for any i G C(j), V j G J 

For any set of server types J'* C J' and any set of customer types X* C X such 
that qi(t) > for all i e 1*, and qi(t) > qi>{t) whenever i e 1*, i' ^ I* and 

s{i) n s{t') n ^ 0, 

For any sets of customer types X* C X, and any set of server types J* ^ J such 
that pj{f) < 1 for all j G JT"*, and pj(t) < Pj'(t) whenever j G JT"*, j' ^ JT*, and 

C(j)nc(/)nx, ^0, 

(5fb) Y Y ^'*^(^)= Y ^^-Y Y /"^^-^^^W 



The meaning of (5fa) is as follows. Consider a set of server types If a set of customer 
types X* consists of the "longest queues for J'*" (we will make this more precise), then 
servers in pools j* G JT*, whenever they finish serving some customer, will immediately 
replace her with someone from a queue in X*. In this case, the total number of customers 
of types X* in service by servers of types Jj* will be increasing at the total rate of servicing 
all customers by servers in J'*, less the rate of servicing customers of types X* by servers in 
J'*. The requirements that X* needs to satisfy for this to be the case are, that there be no 
customer types outside X* with longer queues that servers in J'* can serve. For example, a 
one-element set X* = {i*} is a valid choice for a one-element set J'* = {j*} if and only if the 
customer type i* G C{j*) has the (strictly) longest queue among all of the customer types 
that can be served by j*. 
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The second equation, (5fb), describes the fact that if a set of server pools consists of 
the "least loaded server pools available to Z,,", then servers in pools j* G JT"*, whenever they 
finish serving some customer, will immediately replace her with someone from queue i*. For 
example, a one-element set = {j^,} is a vahd choice for a one-element set X^: = {i^} if and 
only if the server pool j^, G iS(i*) has the (strictly) smallest load pj, among all of the server 
pools that can serve i*. 

3.2. Behavior in the vicinity of equilibrium point. We define the equilibrium (invari- 
ant) point of the underloaded (p < 1) fluid model to be the state ipij = i'ij and qi = q = 
for all i & j & J . (All other components of the fluid model are also constant and uniquely 
defined by (V^*^) and q.) Clearly, iJijit) = ifj^j and qi{t) = g is indeed a stationary fluid model. 
Desirable system behavior would be to have (ipijit)) — )■ {'ipij) as t — )• oo. 

Note that if the initial system state is in the vicinity of the equilibrium point (with p < 1), 
then there is no queueing in the system, and we can describe the system with just the 
variables (ipijlt)). This will be true for at least some time (depending on p and the initial 
distance to the equilibrium point), because the fluid model is Lipschitz. 

The following is a "state space collapse" result for the underloaded fluid model in the 
neighborhood of the equilibrium point. 

Theorem 3.1. Let p < 1. There exists a sufficiently small e > 0, depending only on the 
system parameters, such that for all sufficiently small 5 the following holds. There exist 
Ti = Ti{6) and T2 = T2{6), < Ti < T2, such that if the initial system state {ipij{0)) 
satisfies 

|(^,,(0))-(^^)| 
then for all t G [Ti, T2] the system state satisfies 

\i^pijit)) - < e, pj{t) = py{t) for all],]' G J. 

Moreover, Ti J, and T2 t 00 as 5 The evolution of the system on [Ti,T2] is described 



by a linear ODE, specified below by (10). 



If the fluid system is critically loaded (p = 1), it may have queues at equilibrium, and 
the equilibrium is non-unique. Namely, the definition of an equilibrium (invariant) point for 
p = 1 is the same as for the underloaded system, except the condition on the queues becomes 



qi = q for some constant q > 0. In the next Theorem 3.2 we will only consider the case of 



positive queues {q > 0) for the critically loaded fluid model. 

Theorem 3.2. Let p = 1, and consider an equilibrium point with q > 0. There exists a 
sufficiently small e > 0, depending only on the system parameters, such that for all sufficiently 
small 6 > the following holds. There exist Ti = Ti{6) and T2 = T2{6), < Ti < T2, such 
that if the initial system state satisfies 

|(^i,(0)) - (^*,.)| < ki(0) -q\<S for all i G X, 
then for all t G [Ti, T2] the system state satisfies 

\{.i'ijit)) - < e, |g,(t) -q\ <e for all i G X, 

qi(t) = qi'it) for all i, i' G X. 
Moreover, Ti I and T2 t 00 as 6 10. The evolution of the system on [Ti,T2] is described 



by a linear ODE specified below by (11). 
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In the rest of this section and the paper, the values associated with a stationary fluid 
model, "sitting" at an equilibrium point, are referred to as nominal. For example, is the 
nominal occupancy (of pool j by type i), Aj is the nominal arrival rate, Xij is the nominal 
routing rate (along activity (ij)), 4'*jfJ'ij = \j is the nominal service rate (of type i in pool j), 
i^ijf^ij = is the nominal total service rate (of type «), p is the nominal total occupancy 
(of each pool j), etc. 



Proof of Theorem 3A. Let us choose a suitably small e > (we will specify how small later). 
Because the fluid model trajectories are continuous, we can always choose some T2 > such 
that, for all sufficiently small 5 > 0, if \{ipij{0)) — < S then \ {ipij{t)) — < e for all 

t < T2. We will show that pj(t) = Pj'(t) for all G J', in [Ti,T2] for some Ti depending 
on 6. 

Consider p=„(t) = minjPj(t), p*(t) = maxj pj(t), and assume p*(t) < p*it). Let J'^{t) = 
{j : pj(t) = p^{t)}. As long as p*(t) < p*(t), J^{t) is of course a strict subset of J. The 
total arrival rate to servers of type j G J^:{t) is X^ieu {t)C(j)'^'''-- assumption of 

the connectedness of the basic activity tree, this is strictly greater (by a constant) than the 
nominal arrival rate Yli^^ca) j<^j,(t)^ir "^^^ total rate of departures from those servers is 
'I2i(^c{j) j(^j*{t) f^ij'^iji^)- -^^^ small e, the assumption |(^/'ij(t)) — ("^j*,)! < e implies that this 
is close to the nominal departure rate, so the arrival rate exceeds the service rate by at 
least a constant. (This determines what "suitably small" means for e in terms of the system 
parameters.) Consequently, as long as p*(t) < p*(t), the minimal load p*(t) is increasing at a 
rate bounded below by a constant. Similarly, as long as p*(t) < p*(t), the maximal load p*(t) 
is decreasing at a rate bounded below by a constant. Therefore, the difference p*(t) — p*it) 
is decreasing at a rate bounded below by a constant whenever it is positive. Thus, in finite 
time Ti = Ti{6) we will arrive at a state p*(t) = p*(t). (Clearly, Ti{6) — )■ as 5 — )■ 0.) 
Since the function p*(-) — p*(-) is Lipschitz (hence absolutely continuous), bounded below 
by 0, and (for t < T2) has nonpositive derivative whenever it is differentiable, the condition 
P*it) = P*(t) "will continue to hold for Ti < t < T2. 

It remains to derive the differential equation, and to show that T2 can be chosen depending 
on S so that T2 t 00 as 5 J, 0. 

Once we are confined to the manifold Pj{t) = Pj'{t) = p{t) for all t, the system evolution 
is determined in terms of only I independent variables. Decreasing e if necessary to ensure 
that there is no queueing while \{ipij{t)) — {ip*j)\ < e, we can take the I variables to be 
i'iit) '■= Given {ipi{t)) we know p{t) as Pj)- Consequently, we 

know = P{^)Pj Ylij i^ijit) = i^iit)- On a tree, this allows us to solve for ipijit); 

the relationship will clearly be linear, i.e. 

(7) {^,,{t)) = M{Ut)) 

for some matrix M. For future reference, we define the ("load balancing") linear mapping 
M from y & W to z = {zij, (ij) G £^) G M^"*"-^"-^ as follows: z = My is the unique solution of 



The evolution of ipi{t) is given by 
(9) ipi{t) = \i -^pij'4)ij{t), Mi 
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(Follows from (5c) and the fact that qi{t) = 0.) Then, by the above arguments we see that 
this entails (in matrix form) 

(10) (Mt)) = iX^)+A^{^,{t)), 

where is an J x J matrix, = GM. Here, G is a / x (/ + J — 1) matrix with entries 
Gi,{kj) = —fJ'ij if i = k, and Gi^^kj) = otherwise. 

It remains to justify the claim that T2{6) t c>o as 5 J, 0. This follows from the fact that, 
as long as t > Ti and |('?/'ij(t)) — < e, the evolution of the system is described by the 

linear ODE above. The solutions have the general form 

Mt) -rx = eMAuit - T^))iMTi) - ri), Mt) -re = MiMt) ' V^x) 

where M and Au are constant matrices depending on the system parameters. Therefore, 
if \ipx(Ti) — ipx\ < is sufficiently small, then the time it takes for ipeit) to escape the set 
li^eif) — '^*s\ < e can be made arbitrarily large. Since as 5 J, we have Ti{5) J, 0, and the 
system trajectory is Lipschitz, taking \ips{'^) ""^fl < 5 for small enough 5 will guarantee 
that \ipx{Ti) — ipxl is small, and hence we can choose T2{6) t oo. □ 



The proof of Theorem 3.2 proceeds similarly; we outline only the differences. 



Proof of Theorem 3^. First, since we assume that e > is sufficiently small and \qi{t) — q\ < 
e, i El, for all t < T2, we clearly have pj(t) = 1, j E J', for all t < T2. The equality of queue 
lengths in [Ti,T2] is shown analogously to the proof of p*(t) = p*(t) for in the underloaded 
case. Namely, the smallest queue must increase and the largest queue must decrease (as long 
as not all qi(t) are equal), because it is getting less (resp. more) service than nominal (we 
choose e small enough for this to be true provided \{i'ij{t)) — ii^ij)] < e)- Thus, in [Ti,T2] 
we will have qi(t) = qi'it) for all z, i' G X. 

The linear equation is modified as follows. We have 

j 

where Xi(t) = ^Jiit) + qiit). Since we know that all qi{t) are equal and positive, we have 
= <l{i) = jiJl^kit) - and therefore 



1pi{t) = Xiit) - y ^Xfc(t). 



k 

The rest of the arguments proceed as above to give 

(11) {Ut)) = {x.-\Y.^^)+Acmt)) 

i 

for the appropriate matrix A^ which can be computed explicitly from the basic activity tree. 
(Of course, in [Ti,T2], the trajectory {ipij^-)) uniquely determines (?/'j(-)), (xj(-)) and (gj(-)).) 

Just as above, the existence of the linear ODE, together with the fact that Ti J, as 5 J, 0, 
implies that T2 t cxd as 5 J, 0. □ 

To compute the matrix M, and therefore the matrices Au and Ac, we will find the following 
observation useful. If (ipijit)) = M{ipi(t)), then the common value p(t) = pj{t),\/j, is 
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This allows us to find the values (ipijlt)) from (ipilt)) as follows: if i is a customer-type leaf, 
then ipij{t) = ipi{t)] if j is a server type leaf, then ipij{t) = p{t)Pj] we now remove the leaf 
and continue with the smaller tree. Inductively, for an activity iojo we find 
(12) 

«d(«ojo) id(«ojo) \«d{«o,io) id{io,«o) «d{io,«o) id{«o Jo) 

Here, the relation ^ is defined as follows. Suppose we disconnect the basic activity tree by 
removing the edge (io; Jo)- Then for any node k (either customer type or server type) we say 
k ^ (^0, jo) if it falls in the same component as io; otherwise, k ^ (jo^^o)- 
For example, consider the network in Figure [ij For it, we obtain 




A) Cb 



Figure 1. Example for calculation of the matrix M. 
Since in underload we have 

j 

we obtain an expression for Ay_, given in Lemma [3.3[ i) just below. 

Lemma 3.3. (i) The entries {Au)ii' of the matrix Au (for the underload case, p < 1) are as 
follows. The coefficient of ipi in ipi is 

{Au)ii = E E 

The coefficient of ipii in ipi is 

E ^^J' E ^J' + ^^J"' E ^J' 

— {Au)ii + pij-.,- 

Here, jui G S{i) is the neighbor of i such that, after removing the edge {i,jii') from the basic 
activity tree, nodes i and i' will be in different connected components. (Such a node is unique, 
since there is a unique path along the tree from i to i' .) 
(a) The matrix A^ is non-singular. 

(Hi) The matrix Au depends only on (Pj), (pij) and the basic activity tree structure S, and 
does not depend on (Aj) and (ipij). 
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Proof, (i) In the proof of Theorem 3.1 we showed Au = GM, where Gisa/x (J + J — 1) 
matrix with entries Gi^(^kj) = —SikfJ^ij, where 6ik is the Kronecker's deha function, and M is 
the / X (/ + J — 1) load-balancing matrix whose entries are determined from the expression 



(12). The form of the entries for Au now follows. The equality between the two expressions 
for the off-diagonal entries is a consequence of the fact that, for all j', exactly one of j' ^ (ij), 
j' ^ (jO holds. 

(ii) In the case p < 1, in the vicinity of the equilibrium point, the derivative (tpi) = (Aj) + 
Aui'ipi) (which can be any real- valued /-dimensional vector, within a small neighborhood of 
the origin) uniquely determines (ipij), and then (ipi) as well. Indeed, we have the system of 
J + J linear equations Aj — J2j f^iji^ij = i'i^ and ipij = p(3j, Vj, for the J + J variables 
p, (ipij). This system has unique solution, because p is uniquely determined by the workload 
derivative condition 

i i j 

and then the values of ipij are determined by sequentially "eliminating" leaves of the basic 
activity tree. 

(iii) Follows from (i). □ 
The corresponding expression for A^. is less elegant: 

Lemma 3.4. (i) The entries (Ac)jj' of the matrix Ac (for the critical load case, p = 1) are 
as follows: 



(13) {Ac)ii' = iAu)ii' - J 



ki' 



(ii) The matrix A,, has rank I — 1. The (/ — 1)- dimensional suhspace L = {y \ YliVi ~ 0} 
is invariant under the transformation A^, i.e. A^L C L. Letting vr denote the matrix of the 
orthogonal projection (along (1, . . . , 1)'^ ) onto L, we have Ac = tiA^. Restricted to L, the 
transformation Ac is invertible. 

(iii) The linear transformation Ac, restricted to suhspace L, depends only on {pij) and the 
basic activity tree structure S, and does not depend on [Pj), (Aj) and 

Proof, (i) The fluid model here is such that there are always non-zero queues, which are 
equal across customer types. We can write 

(14) i)i{t) = Xi{t) - j'^Xkit) = (Ai -^Pijipij{t)) - J ^(Afc -^Pkji^kjit)), 



which implies (13). 



(ii) First of all, it is not surprising that Ac does not have full rank: the linear ODE defining 
Ac is such that J2i V'i(^) = at all times, so there are at most (J — 1) degrees of freedom 



in the system. Also, it will be readily seen that (13) asserts precisely that Ac = nAu- Since 
Au is invertible and vr has rank / — 1, their composition has rank I — 1. Since the image of 
Ac is contained in L, the image of Ac (as a map from M^) must be equal to all of L. 

It remains to check that Ac restricted to L still has rank / — I. To see this, we observe that 
the simple eigenvalue of Ac has as its unique right eigenvector the vector A~^{1, 1, . . . , 1)'''. 
We will be done once we show that this eigenvector does not belong to L. Suppose instead 
that AuV = (1, 1, . . . , 1)''' for some v E L, Vi = 0. Then, for a small e > 0, the state ifj^ — ev 
(with balanced pool loads, all equal to the optimal p) would be such that the derivatives of 
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all components ^pi would be strictly negative. This is, however, impossible, because the total 
rate at which the system workload is served, must be zero: 

^ J2 ^^^^ = Yl " Yl = °- 



(iii) The specific expression (13) for Ac may depend on the pool sizes (/3j). However, A^. 
is a singular I x I matrix, and our claim is only about the transformation of the (/ — 1)- 
dimensional subspace L that A^ induces; this transformation does not depend on (Pj), as 
the following argument shows. 

Pick any (ij) G £. Modify the original system by replacing f3j by (3j+6 and Aj by Xi + 6^ij; 
this means that the nominal ipij is replaced by ipij+S. Then, using notation 7j(t) = ipiit) 
the linear ODE 

(15) = Ml^it)), 



which we obtain from the ODE (14) for the original and modified systems, has exactly the 
same matrix A, which implies A = A^. Thus, the transformation A^ must not depend on/?,. 

An alternative argument is purely analytic. Recall that to compute we used (12). 

In critical load, we have p{t) = 1, so the (left) equation ( |T2) ) for ^i^jgit) simplifies to 

(16) ^iodt)= ^^W- Y f^r 

«d(«o,io) id{io,io) 



If we substitute this in the right-hand side of (14), we will obtain a different expression for 



ipiif)- While its constant term will depend on /3j', the linear term will not, since the linear 



term of (16) does not depend on (3j. That is, we have found a way of writing down a matrix 



for Ac which clearly does not depend on the (3j. □ 

3.3. Definition of local stability. We say that the (fluid) system is locally stable, if all 
fluid models starting in a sufliciently small neighborhood of an equilibrium point (which is 
unique for p < 1; and for p = 1 we consider any equilibrium point with equal queues g > 0) 
are such that, for flxed constant C > 0, 

where Aq = \{ipij{Q)) — {ip*j)\ + |(Q'j(0)) — (g, . . . ,g)^|. Note that in the case p = 1 it is not 
required that qi{t) — t- g, for q associated with the chosen equilibrium point. However, local 
stability will guarantee convergence of queues qi{t) — )■ g, with some g > possibly different 
from g. Indeed, the exponentially fast convergence ipsit) — ^ V"! of the occupancies to the 
nominal, guarantees that for some flxed constant Ci > 0, any i and any s > t > 0: 

\xi{s) - x,{t)\ < [ \Xi -Y^l^ijiJijiOld^ < CiAoe-^*. 
Therefore, each Xi{t), and then each qi{t), also converges exponentially fast. Then we can 



apply Theorem 3.2 to show that all qi{t) must be equal starting some time point; therefore 
they converge to the same value g, which is such that that |g — g| < CqAq for some constant 
Co > depending only on the system parameters. In other words, local stability guarantees 
convergence to an equilibrium point not too far from the "original" one. (We omit further 
detail, which are rather straightforward.) 
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By Theorems 3.1 and 3.2 we see that the local stability is determined by the stability of 
a linear ODE, which in turn is governed by the eigenvalues of the matrix or A^. We 
will call matrix Au stable if all its eigenvalues have negative real part. We call matrix Ac 
stable if all its eigenvalues have negative real part, except one simple eigenvalue o|^ In this 
terminology, the local stability of the system is equivalent to the stability of the matrix A in 
question (either Au or Ac). On the other hand, if A has an eigenvalue with positive real 
part, the ODE has solutions diverging from equilibrium (ip*) exponentially fast; if A has 
(a pair of conjugate) pure imaginary eigenvalues, the ODE has oscillating, never converging 
solutions. 

3.4. Fluid model as a fluid limit. In this section we show that the set of fluid models 



defined in Section 3.1 contains (in the sense specified shortly) all possible limits of "fluid 
scaled" processes. We consider a sequence of systems indexed by r, with the input rates 
being A[ = rAj + o(r), server pool sizes being f3jr, and the service rates fiij unchanged with 



r. Recall the notation in Section 2^ We also add the following notation: 

y4[(t): the number of customers of type i who have entered the system by time t (a 

Poisson process of rate A[) 
S^j{t): the number of customers of type i who have been served by servers of type j if 

a total time rt has been spent on these services (a Poisson process of rate /ij^r) 

Let z G X, and (ij) G S, he independent unit-rate Poisson processes. We 

can assume that, for each r, 

A:{t) = nt\\:t), ^i;(t) = ng)(^,,rt). 

Then, by the functional strong law of large numbers, with probability 1, uniformly on com- 
pact subsets of [0, oo), 

(17) ^A[(t)^A,t, l5i;.(t)^/x,,t. 

We consider the following scaled processes: 

xm = lxiit) qKt) = lQm r^,{t) = l%,it) m = -^mt) aiit) = U:{t). 

Theorem 3.5. Suppose 

{«(0)), (g[(0)), (^[,(0)), (p,^(0))} ^ {(x.(0)), (g.(0)), i^O)), (p,(0))}. 
Then w.p.l any subsequence of {r} contains a further subsequence along which u.o.c, 

{«(■)), <(■)), (■)), (^[,(-)), (P •(■))} ^ {(«.(■)), ^.(O), iM-)), (p. (■))}, 

where the limiting trajectory (in the RHS) is a fluid model. 



Proof. Given property (17), the probability 1, u.o.c, convergence along a subsequence to 



a Lipschitz continuous set of functions easily follows. The only non-trivial properties of a 



fluid model that need to be verified for the limit are (5f). Let us consider a regular time t: 
namely, such that all the components of a limit trajectory have derivatives, and moreover the 
minimums and maximums over any subset of components have derivatives as well. Consider 



matrix A with all eigenvalues having negative real part is usually called Hurwitz. So, stability is 
equivalent to being Hurwitz; while Ac stability definition is slightly different, due to Ac singularity. A 
symmetric matrix A is Hurwitz if and only if it is negative definite, but neither A^ nor Ac is, in general, 
symmetric. 



14 



ALEXANDER L. STOLYAR AND ELENA YUDOVINA 



a sufficiently small interval [t,t + At], and consider the behavior of the (fluid-scaled) pre-limit 



trajectory in this interval. Then, it is easy to check that the conditions (5f ) on the derivatives 



must hold; the argument here is very standard - we omit details. □ 

4. Special cases in which fluid models are stable 

In this section we analyze two special cases of the system parameters, for which we demon- 
strate convergence results. In Section |4.1| we consider the case when there exists a set of 
positive fij, j G JT", such that /i^ = fij for (ij) G S (i.e. the service rate fiij is constant across 
all i G C{j)); we show global convergence of fluid models to equilibrium. In Section 4.2 we 
consider the case when there exists a set of positive /ij, z G X, such that /ijj = /ij for (ij) G S 
(i.e. the service rate fiij is constant across all j G we show local stability of the fluid 

model (i.e. stability of Au and Ac). 

4.1. Global stability in the case /ijj = nj, (ij) G S. We call the system globally stable 
if any fluid model, with arbitrary initial state, converges to an equilibrium point as t — )■ oo. 
(This of course implies pj{t) — )■ p for all j & J and ''pijit) — )■ ip*j for all i & X, j E J . Note 
that, in underload, the deflnition necessarily implies gj(t) — ?■ for all z G X, while in critical 
load it requires qi{t) — )■ q for all i eT and some q > 0.) 

Theorem 4.1. The system with pij = fij, (ij) G S, is globally stable both for p < I and 
for p = I. In addition, the system is locally stable as well (i.e. the matrices Au and A^ are 
stable). 

Proof. Consider the underloaded system, p < 1, flrst. First, we show that the lowest load 
cannot stay too low. Suppose the minimal load p* (t) = mmj pj (t) is smaller than p, and 
let i7*(t) = {j : Pj{t) = p*(t)}. Then all customer types in C{J^{t)) = Ujej,(t)^0) 
routed to server pools in J^^.{t), so the total arrival rate "into" iZ,(t) is no less than nominal; 
on the other hand, since pij = pj and server occupancy is lower than nominal, the total 
departure rate "from" J^^{t) is smaller than nominal. This shows that if p^, < p — e < p, then 
p^ > 6 > 0, where S > ce for some constant c > (depending on the system parameters). 
That is, if p*(t) < p then p^:(t) > c{p — p*(t)), so p*(t) is bounded below by a function 
converging exponentially fast to p. 

Consider a flxed, sufficiently small e > 0; we know that p*(t) > p — e for all large times t. 
If some customer class i has a queue qi{t) > 0, then all server classes j G S{i) have pj{t) = 1. 
It is now easy to see that the system is serving customers faster than they arrive (because 
p < 1 and e is small). This easily implies that all qi{t) = after a flnite time. 

In the absence of queues, we can analyze p* [t) = max^ pj {t) similarly to the way we treated 
p*{t); namely, we show that p*{t) is bounded above by a function converging exponentially 
fast to p, which tells us that pj{t) — )■ p for all j. Once all pj{t) are close enough to p, we 



can use the argument essentially identical to that in the proof of Theorem 3.1| to conclude 



that, after a further flnite time, we will have Pj{t) = Pj'{t) for all j, j'. (The argument 



is even simpler, because, unlike in Theorem 3.1, where it was required that {ipij{t)) were 



close to nominal, here it suffices that {pj{t)) are close to nominal, because of the pij = pj 
assumption.) With p(t) = Pj(t), Vj, we then have for the total amount of "fluid" in the 
system: 

(d/dt) J2 PAt) = J2^^-Y1 f^Ai)l^r 
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This is a simple linear ODE for p{t), which implies that (after a finite time) p{t) — p = 
Ciexp(— C2t), with constant C2 > and Ci. This in particular means that Pj{t) = p{t) — > 0. 
Denote by Xij{t) the rate at which fluid i arrives at pool j, namely 

(18) Xij{t) = p,tij,,{t) + ij,j{ty, 

at any large t we have \ij(t) = Aj. Then, for each j, 



This is only possible if each \ij{t) — )■ Ajj. But then the ODE (18) implies ilJijit) — )■ ip*y 

Now, consider a critically loaded system, p = I. Essentially same argument as above tells 
us that, as long as not all queues qi(t) are equal, each of the longest queues gets more service 
than the arrival rate into it, and so q*{t) = maxgj(t) has strictly negative, bounded away 
from derivative. If all qi{t) are equal and positive, then q*{t) = 0. We see that q*(t) is 
non-increasing, and so q*it) I q > 0. We also have p^:{t) ^ p = 1 exponentially fast. (Same 
proof as above applies.) These facts easily imply convergence to an equilibrium point. We 
omit further detail. 

Examination of the above proof shows that it implies the following property, for both 
cases p < 1 and p = 1. For any fixed equilibrium point (with g > if p = 1), there exists 
a sufficiently small e > such that for all sufficiently small 6 > 0, any fluid model starting 
in the 5- neighbor hood of the equilibrium point, flrst, never leaves the e- neighborhood of the 
equilibrium point and, second, converges to an equilibrium point (possibly different from the 
"original" one, if p = 1). This property cannot hold, unless the system is locally stable (see 



Section 3.3). 



□ 

4.2. Local stability in the case pij = pi, (ij) E 8. 

Theorem 4.2. Assume p < I and ptj = pi for {ij) G £. Then the system is locally stable 
(i.e. Au is stable). 

Proof. We have 

ipi(t) = \i- Pii)i{t) 

and Au is simply a diagonal matrix with entries —pi- □ 

Theorem 4.3. Assume p = 1 and pij = pi for (ij) G £. Then the system is locally stable 
(i.e., Ac is stable). 



Proof. As seen in the proof of Theorem |4.2[ the matrix Au in this case is diagonal with 
entries —pi. By Lemma 3.4, Ac has off-diagonal entries {Ac)ii' = Pi'/I and diagonal entries 
— Pj(l — 1//). That is, its off-diagonal entries are strictly positive. Therefore, Ac + r]I for 
some large enough constant > (where / is the identity matrix) is a positive matrix. 
By Perron- Frobenius theorem Chapter 8] , Ac + rjl has a real eigenvalue p + r] with the 
property that any other eigenvalue of Ac + r]I is smaller than p -|- 77 in absolute value (and 
in particular has real part smaller than p + t]). Moreover, the associated left eigenvector 
w is strictly positive, and is the unique (up to scaling) strictly positive left eigenvector of 
Ac + T]I. Translating these statements to Ac, we get: Ac has a real eigenvalue p; all other 
eigenvalues of Ac have real part smaller than p; Ac has unique (up to scaling) strictly positive 
left eigenvector w; and the eigenvalue of w is p. 
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Now, Ac has a positive left eigenvector with eigenvalue 0, namely (1, 1, ... , 1). Therefore, 
we must have p = 0, and we conclude that all other (i.e., non-zero) eigenvalues of A^ have 
real part smaller than 0, as required. □ 

5. Fluid models for general Hij: local instability examples. 

In Sections |4.1 4.2 we have shown that the matrices Au and Ac are stable in the cases 
fiij = /ij, (ij) G £ and /ijj = /Xj, (ij) G £. Since the entries of A^, Ac depend continuously on 
Hij via Lemmas 3.3, 3.4, and the eigenvalues of a matrix depend continuously on its entries, 
we know that the matrices will be stable for all parameter settings sufficiently close to those 
special cases. Therefore, there exists a non-trivial parameter domain of local stability. One 
might consider it to be a reasonable conjecture that local stability holds for any parameters. 
It turns out, however, that this conjecture is false. We will now construct examples to 
demonstrate that, in general, the system can be locally unstable. 

Remark 5.1. In examples below, we will specify the parameters fi^ and sometimes but 
not Ax. It is easy to construct values of Ax which will make all of the activities in S basic; 
simply pick a strictly positive vector ips, such that all loads '^i'lpij/f^j are equal, and set 
A,: = y^„- i Mil. Lemmas 3.3 

A 



affect the matrices A. 



iii) and 3.4 (iii) guarantee that the specific values of Ax do not 



In critical load, we also do not need to specify fij. 



Local instability example 1. Consider a system with 3 customer types A, B, C and 
4 server types 1 through 4, connected 1— A — 2 — B — 3 — C — 4. Set (3i = 0.97 and 
/32 = /33 = /34 = 0.01. Set /xai = fJ^B2 = fJ'Cs = 1, and fiA2 = f^Bs = fJ^CA = 100. (See 
Figure |2}) On the other hand, we compute by Lemma 3.3 



o o o 



10(\ / 10(\ / 10ft 



0.97 0-01 0.01 0.01 



Figure 2. System with three customer types whose underload equilibrium is unstable 

4.99 -0.99 -0.99^ 
Au = I 97.02 -2.98 -1.98 
96.03 96.03 -3.97, 



with eigenvalues { — 17.8,4.45 ± 23. 4z}. Therefore by Theorem 3.1, the system with these 
parameters is described by an unstable ODE in the neighborhood of its equilibrium point. 

We now show that this is a minimal instability example, in the sense made precise by the 
following 

Lemma 5.2. Consider an underloaded system, p < 1. 

(i) Let I > 2. Any customer type i that is a leaf in the basic activity tree, does not affect 
the local stability of the system. Namely, let us modify the system by removing type i, and 
then modifying (if necessary) input rates Xk of the remaining types k &X\i so that the basic 
activity tree of the modified system is S\ (ij), where (ij) is the (only) edge in £ adjacent to 
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i. Then, the original system is locally stable if and only if the modified one is. 
(a) A system with two (or one) non-leaf customer types is locally stable. 

Proof, (i) If type z is a leaf, the equation for ipiit) is simply ipi{t) = \i — ^ijipi{t). This means 
(setting i = 1) that (1,0,..., 0)''" is an eigenvector of with eigenvalue —jiij. Further, it is 
easy to see that: (a) the rest of the eigenvalues of A^ are those of matrix obtained from 
Au by removing the first row and first column; and (b) is exactly the "y4„-matrix" for 

the modified system. 

(ii) We can assume that there are no customer type leaves. The case / = 1 is trivial (and 



is covered by Theorem 4.1), so let 1 = 2. Throughout the proof, the pool sizes (3j are fixed. 
From Theorem 4.1 we know that for a certain set of service rate values (namely, /ijj = fij, 
iij) e the matrix A^ is stable. Suppose that we continuously vary the parameters 
IXij from those initial values to the values of interest, without ever making iXij = 0. If we 
assume that the final matrix Au is not stable, then as we change /ijj the (changing) matrix 
Au acquires at some point two purely imaginary eigenvalues. If the eigenvalues of A^ are 
purely imaginary, we must have trace (A^) = 0. However, as seen from the form of A^ in 



Lemma 3.3, the diagonal entries of Au are always negative, and therefore trace (y4„) < 0. 
The contradiction completes the proof. □ 

An argument similar to the above proof also allows us to explain how the instability 
example 1 was found. In degree 3, let the characteristic polynomial of Au be — C2X^ + 
C\x — cq. a necessary and sufficient condition for all roots of the polynomial to have negative 
real parts is: — C2,ci,— cq > and C2C1 < cq (see [31 Al.l.l]j. A necessary and sufficient 
condition for the "boundary case" between stability and instability (i.e. the condition for a 



pair of conjugate purely imaginary roots) is C2C1 = cq. Using Lemma 3.3 we can evaluate the 
characteristic polynomial symbolically and use the resulting expression to find parameters 
for which C2C1 = Cq will hold. See online [TJj for the computations. 

It is possible to construct an instability example with more reasonable values of Pj, fiij, 
although it will be bigger. Figure |3] shows the diagram. The associated matrix Au and its 
eigenvalues can also be found online [1] 



i i i i i i i i i i i i i i i i i i i i i 

000000000000000000000 



Figure 3. System with I3j = 1 and /ijj G {1/3,1,3} whose underload equi- 
librium is unstable. There are 21 customer types. 

We do not have an explicit characterization of the local instability domain, beyond the 
necessity of / > 3. 

We now analyze the critically loaded system p = 1 with queues, i.e. the stability of the 
matrix A^. Recall that the transformation A^ restricted to subspace {y \ YliVi ~ 0}' ^^"^ 
then the stability of A^ does not depend on the values of f3j, so it suffices to specify the 
values fiij. 

Local instability example 2. Consider the network of Figure |4], which has 5 customer 
types A through E and 4 server types 1 through 4, connected A — 1—B — 2 — C — 3 — D — A — E, 
with the following parameters: 
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f^D3 



100 
100 



fJ'B2 = 1 
/iD4 = 10000 



100 
100 



o o o o o 



Figure 4. System with five customer types whose critical load equilibrium is unstable 



The matrix Ac, computed from Lemma 3.4 will be given by 



Ar 



1 

20 



V 



9389 
10894 
10399 
-40091 

9409 



9805 
9290 
10795 
-39695 
9805 



10201 
9706 
9191 
-39299 
10201 



10597 
10102 
9607 
-40903 
10597 



-29003\ 
-29498 
-29993 
119497 
-31003 J 



and the eigenvalues of A^ are {0, -16.88, -2190.05, 2.565 ± 23.23i}. 
Again, the above example 2 is in a sense minimal: 

Lemma 5.3. Consider a critically loaded system, p = 1. 

(i) Let J > 2. Any server type j that is a leaf in the basic activity tree does not affect the 
local stability of the system. Namely, let us modify the system by removing type j , and then 
replacing Aj for the unique i adjacent to j by Aj — (Sjfiij . Then, the original system is locally 
stable if and only if the modified one is. 

(a) Consider a system labeled S. We say that a system S' is an expansion of system S if it is 
obtained from S by the following modification. We pick one server type j and one customer 
type i adjacent to it in S; we "split" type j into two types j' and j" ; we "connect" type i to 
both j' and j" ; each of the remaining types i' G C(j) \ i we connect to either j' or j" (but not 
both); if {i'i') (resp. {i'j")) is a new edge, we set fii'j' = fii'j (resp. fii>j" = fii'j.) Then, S is 
locally stable if and only if S' is. 

(Hi) A system with four or fewer customer types is locally stable. 

Proof, (i) The argument here is a "special case" of the one used to show the independence 
of transformation Ac (restricted to (/ — l)-dimensional invariant subspace) from {/3j) in the 
proof of Lemma 3.4 Namely, it is easy to check that the original and modified system share 



exactly same ODE (15) 
(ii 



Again, it is easy to see that the two systems share the same ODE (15) 



(iii) We can assume that there are no server-type leaves, so that the tree £ has only customer- 
type leaves, of which it can have two, three, or four. 

If it has four customer type leaves, then the tree has a total of four edges, hence five nodes, 
i.e. a single server pool, to which all the customer types are connected. 

If the tree has three customer type leaves, then letting k be the number of edges from the 
fourth customer type, we have k + 3 total edges, so k + 4 nodes, of which k are server types. 
That is, the non-leaf customer type is connected to all of the server types. Since there are 
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no server type leaves, we must have k < 3; since we are assuming the fourth customer type 
is not a leaf, we must have k > 2; thus, k = 2 or k = 3. 

The last case is of two customer type leaves. Letting k, I be the number of edges coming 
out of the other customer types, we have k + I + 2 edges. On the other hand, since each 
server type has at least 2 edges coming out of it, we have at most {k + l + 2)/2 server types, 
so at most (fc + Z + 2)/2 + 4 nodes. Thus, we have {k + I + 2) + 1 < (A; + Z + 2)/2 + 4, or 
k + I + 2 < 6, giving k = I = 2 (since they must both be > 2). 

We summarize the possibilities in Figure |5} Note that the bottom-left system can be 
obtained by a sequence of expansions from each of the top-left systems, and so this is the 
only system we need to consider to establish local stability for all 3- and 4-leaf cases. Thus, in 
total, the only two systems that need to be considered are bottom-left and right. In both of 



oooo 

2 leaves 



OOOO OOO/O ooo^o 



4 leaves 3 leaves 

000^0 

3 or 4 leaves 



Figure 5. Possible arrangements of four customer types. 



the resulting cases, we can use Lemma 3.4 to write out and its characteristic polynomial 
explicitly. The characteristic polynomial will have degree 4, but one of its roots is 0, so we 
can reduce it to degree 3. We then symbolically verify that the cited above stability criterion 
[HI Al.1.1] for degree 3 polynomials, is satisfied. See online ^H] for the details. □ 

An argument similar to that in the above proof allows us to explain how the instability 
example 2 was found. We seek a condition satisfied by the coefficients of a degree 4 poly- 
nomial with two imaginary roots. Letting the polynomial be x"^ — c\x^ + C2X^ — c^x + C4, 
and letting the roots be rji, 772, ±iz (where rji and ri2 may be real or complex conjugates, 
and z G M), we see that Ci = ?7i + ri2, C2 = 77i?72 + z"^, C3 = {rji + r]2)z^, and C4 = rjirj2z'^. 
This implies the relation Cic\ + Cg — C1C2C3 = 0, and we can find the parameters for which 
this is true. (The symbolic calculation will involve rather a lot of terms.) We remark that, 
whereas for degree 3 polynomials the condition C2C1 — cq = is necessary and sufficient for 
the existence of two imaginary roots |3l Al.1.1], the condition we derive here for degree 4 
polynomials is necessary, but not sufficient. (For example, the polynomial (x — l)^(x + 1)^ 
has ci = C3 = 0, so Cic\ + c| — C1C2C3 = 0, but it has no imaginary roots.) Thus, checking the 
sign of the corresponding expression alone is insufficient to determine whether the system is 
unstable, but is a useful way of narrowing down the parameter ranges. 

Finally, it is possible to construct a single system which will be unstable both for p < 1 and 
for p = 1 with positive queues. For the local stability of the underloaded system, the leaves 
of the basic activity tree corresponding to customer types are irrelevant (the corresponding 
occupancy on the sole available server class converges to nominal exponentially). On the 
other hand, for the critically loaded system, the leaves corresponding to server pools are 
irrelevant, since the corresponding server is fully occupied by its unique available customer 
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type. This observation allows us to merge the above two systems into a single one which is 
unstable both in underloaded and in the critically loaded case. 

Consider a system with 5 customer types A through E and 5 server types through 4 
connected asO — A — 1 — B — 2 — C — 3 — D — A — E, with fiAo = 100 and the remaining 
Hij as in the critically loaded case. Set /^s = 0.96 while (3o, (32, (^^ = 0.01. (See Figure [6]) 
By the above discussion, this system must be unstable for p = 1 and positive queues. We 

o o o o o 

■/\/\/\/"\/ 



Figure 6. System with five customer types whose underload and critical load 
equilibrium points are both unstable 

therefore need to consider only the first 4 customer types {E is a customer type leaf and 



in underload. 
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and eigenvalues are { — 14.6, —201.1,3.91 ± 18.12}. 

While we showed above that sufficiently small systems are at least locally stable, we will 
show now that, in the underload case, any sufficiently large system is locally unstable for 
some parameter settings. 

Lemma 5.4. In underload (p < 1), any shape of basic activity tree that includes a locally 
unstable system (i.e., with A^ having an eigenvalue with positive real part) as a subset will, 
with some set of parameters (Pj), {pij), become locally unstable. In particular, any shape 
of basic activity tree that includes instability example 1 (Figure^ above (for p < 1) will be 
locally unstable for some set of parameters Pj, pij. 

Proof. Let U be any system whose underload (p < 1) equilibrium is locally unstable, e.g. 
one of the examples given above, with the associated fixed set of parameters pij, f3j and Aj. 
Let S' be a system including ?7 as a subset, namely: the activity tree of S' is a superset of 
that of U; the pij and f3j in U are preserved in S; the fiij in 5" are fixed. Consider a sequence 
of systems S**^ in which /3j = e — )■ for all j not in U. For each e, take so that all of the 
activities are indeed basic, and such that, as e — )■ 0, Af — )■ A,- for i in f/, and Af — )■ for i not 



in U. (See Remark 5.1 ,) Order the ijji so that the customer types i inU come first. Suppose 
there are / customer types in U and I + k customer types in S. Let be the (I + k) x {I + k) 
matrix associated with S**^, and let A^ be the I x I matrix associated with U considered as 
an isolated system. Then as e — )■ the top left I x I entries of converge to A^, while the 
bottom left k x I entries of converge to 0. (That is, the effect of U on the stability of 
the rest of the system vanishes - this is due to the fact that pool size parameters (3j in U 
remain constant, while /3j — )■ in the rest of the system.) Consequently, each eigenvalue of 



LOAD BALANCING INSTABILITY 



21 



Au is a limit of eigenvalues of A^. Since had an eigenvalue with positive real part, for 
sufficiently small e the matrix will have at least one eigenvalue with positive real part as 
well, so the system S"^ will be locally unstable. □ 

6. Diffusion scaled process in an underloaded system. Possible 
evanescence of invariant distributions 

Above we have shown that on a fluid scale, around the equilibrium point, the system 
converges to a subset of its possible states, on which it evolves according to a differential 
equation, possibly unstable. This strongly suggests that, when the differential equation is 
unstable, the stochastic system is in fact "never" close to equilibrium. Our goal in this section 
is to demonstrate that it is the case at least on the diffusion scale. More precisely, we consider 
the system in underload, p < 1, and look at diffusion-scaled stationary distributions (centered 
at the equilibrium point and scaled down by \^); we show that, when the associated fluid 
model is locally unstable, this sequence of stationary distributions is such that the measure 
of any compact set vanishes. 

6.1. Transient behavior of diffusion scaled process. State space collapse. In this 
section we cite the diffusion limit result (for the process transient behavior) that we will need 
from [7]. Again, we consider a sequence of systems indexed by r, with the input rates being 
A[ = rXi, server pool sizes being (3jr, and the service rates /ijj unchanged with r. (Here we 
drop the o(r) terms in A[ = rAj + o(r), because, when p < 1, considering these terms does 
not make sense.) The notation for the unsealed processes is the same as in the previous 
section; however, we are now interested in a different - diffusion - scaling. We define 

(19) %{t) = Mi^, ^r^t) =j:mt), ^it) =j:mt) = 

j i ^ 

We will denote by M' the linear mapping from z = {zij, 

given by X]j % = Vi- (So, (^K^)) = ^'(^ij(^))-) There is the obvious relation between 
M' and the operator M defined by ([s]): M'My = y for any ?/ G M^. Let us define Ai : = 
{My I y G M^}, an /-dimensional linear subspace of R-'^"'''^"^; equivalently, Ai = {z E 
z = MM'z}. 



pi+J-i 



Theorem 6.1 (Essentially a corollary of Theorem 3.1 and Theorem 4.4 in [7J). Let p < 1. 

Assume that as r — )■ oo, ^^^(0) — )■ ^^{O) where \l/£-(0) is deterministic and finite. (Conse- 
quently, ^5;(0) -» ^x(O) = M'^£(0).; Then, 

(20) ^J(-) =^ ^x(-) znD'[0,oo), 
and for any fixed t] > 0, 

(21) %{■) =^ M^x{-) in D^+-^-^[r],oo), 
where \i/x(-) is the unique solution of the SDE 

(22) ^i{t) = ^,(0) - J2 I {Mi>x{s)),jds + ^/2\iBi{t), i G X, 
and the processes Bi{-) are independent standard Brownian motions. 
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Recalling the definition of matrix (see (10)), (22) can be written as 

(23) ^x{t) = ^x(O) + / A^^x{s)ds + {^,Bi{t)). 

Jo 



The meaning of Theorem 6.1 is simple: the diffusion limit of the process ^E'x(-) is such that, 
at initial time 0, it "instantly jumps" to the state MM''^£[0) on the manifold Ai (where 
MM'^£-(0) = ^£-(0) only if ^£-(0) G M); after this initial jump, the process stays on M 



and evolves according to SDE (23). Theorem 6.1 is "essentially a corollary" of results in 



[7], because the setting in [7J is such that p = 1, while we assumed p < 1. However, our 



Theorem |6 . 1 1 can be proved the same way, and in a sense is easier, because when p < 1, the 
queues vanish in the limit (which is why the queue length process is not even present in the 
statement of Theorem 6.1[ ). 



6.2. Evanescence of invariant measures. In this section we show that if the matrix Au 
has eigenvalues with positive real part, the stationary distribution of the (diffusion scaled) 
process "^^i-) escapes to infinity as r — )■ cxd. Namely, we prove the following 



Theorem 6.2. Suppose p < I. Consider a sequence of systems as defined in Section 6J_ 
and denote by p^' the stationary distribution of the process '^^i-), a probability measure on 
M.^^"^'^. Let Bk = {1^1 ^ K} C M^"'"'^"^. Suppose the matrix Au has eigenvalues with positive 
real parts and no pure imaginary eigenvalues^ Then for any K , p'^^bx) —^0 as r —)■ oo. 

Before we proceed with the proof, let us introduce more notation and one auxiliary result. 
Let Cx is the submanifold of convergence (stabihty) of ODE {d/dt)y = Auy on IR-^; namely, Cx 
is the (real) subspace of spanned by the Jordan basis vectors for matrix Au corresponding 
to all eigenvalues with negative real parts. Given assumptions of the theorem on Au, the 
solutions to {d/dt)y = Auy converge to exponentially fast if y{0) G Cx, and go to infinity 
exponentially fast if y{0) G \ Cx- Let C = MCx denote the corresponding submanifold of 
convergence (stability) of the linear ODE {d/dt)z = {M AuM')z on z G M.. This ODE is just 
the M-image of ODE {d/dt)y = Auy. Therefore, a solution z{t) converges to exponentially 
fast if z{0) G C, and goes to infinity exponentially fast if z{0) E Ai \ C. Let us denote 
&_ft'(5i,52) := bx n {d{z,Ai) < 6i,d{z,C) > 62}, where d{-, ■) is Euclidean distance. 



Lemma 6.3. Solutions to SDE \23^ have the following properties, 
(i) For any T > and any \l'j(0), 

P{M^x(T) G \ C} = 1; 

(a) For any K > 0, S2 > and e > 0, there exist sufficiently large Tk and K' > K, such 
that, uniformly on M\l/i(0) G &_ft'(0,52), 

P{M^x(T^) G bj,, \ b2K} > 1 - e. 

Proof. Statement (i) follows from the fact that, regardless of the (deterministic) initial state 



\E'i(0), the solution to SDE (23) is such that the distribution of \E'x(T) is Gaussian with 



non-singular covariance matrix. (See |B1 Section 5.6]. In our case the matrix of diffusion 



^The requirement of "no pure imaginary eigenvalues" is made for convenience of differentiating between 
strict convergence and strict divergence. It holds for generic values of /3j, Hij: that is, any set of values /3j, 
fiij has a small perturbation f3j, fllj with for which has no pure imaginary eigenvalues. 
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coefficients is diagonal with entries \/2Ai.) Therefore, the probabihty that \&x(T) is in a 
subspace of lower dimension is zero. 

Statement (ii) follows from the fact (again, see [HI Section 5.6]) that the expectation 
m{t) = E\E'j(t) evolves according to ODE 

■m{t) = Aum{t). 

Since (i(M^j(0),C) > ^2 (and thus ^x(O) is also separated by a positive distance from Cx), 
we have 

\m(t)\ > ai exp{at) 

for some fixed ai,a > and all large t. (Here ai depends on the minimum length of 
the projection of ^x(O) along Cj onto the (real) span of the Jordan basis vectors of Au 
corresponding to eigenvalues with positive real part, and a is the smallest positive real part 
of an eigenvalue of Au.) It is easy to check that if the mean of a Gaussian distribution goes to 
infinity, then (regardless of how the covariance matrix changes) the measure of any bounded 
set goes to zero. On the other hand, both m{t) and the covariance matrix remain bounded 
for all t G [0,Tx], with any T^; then, for any Tx, we can always choose K' large enough so 
that P{M^x(T/^) e bx'} is arbitrarily close to 1. □ 

We are now in position to give a 

Proof of Theorem \6.S\ We will consider measures /i'' as measures on the one-point compact- 
ification M" = M" U {*} of the space M", where n = I + J —1. In this space, any subsequence 
of {/i*^} has a further subsequence, along which ^ for some probability measure /i on 
M". We will show that the entire measure /i is concentrated on the infinity point *, i.e. 
/i(]R") = 0. Suppose not, i.e. /i(]R"') > 0. The proof proceeds in two steps. 

Step 1. We prove that /x(]R") = fi{A4 \ C). Indeed, let us choose any e > 0, and K 
large enough so that ^{hK/2) > (1 — t)^{W^). Then, for all large r, ^'^{bx) > (1 — e)/i(]R"). 
Choose Si > and T > arbitrary. From the properties of the limiting diffusion process 



(Lemma 6.3), we see that we can choose a sufficiently small ^2 > and sufficiently large K' 
such that, uniformly on the initial states ^^(0) G bx, 

liminf P{^^(T) G bx'{6i,62)} > 1 - e. 

r— >oo 

This implies that for all large r, 

fi'{bx'{Si,62))>{l-effi{R^), 

and then fi{bx'{Si,S2)) > (1 — e)'^ fi(M."') . Since e and Si were arbitrary, we conclude that 
/i(]R") < \ C), and then, obviously, the equality must hold. 

Step 2. We show that, for any K > 0, ^{W^\bx) = /i(M"). (This is, of course, impossible 
when yu(]R") > 0, and thus we obtain a contradiction.) It suffices to show that for any e > 0, 
we can choose a sufficiently large K, such that /i(]R" \ bx) > (1 — e)^/i(]R"). Let us choose 
(using step 1) a large K and a small S2 > 0, such that fi{bx/2{Si/2, 2S2)) > (1 — e)/^(I^") for 
any Si > 0. Then, for any fixed Si > 0, for all large r, fi^{bx{Si,S2)) > (1 — e)/i(]R"). Now, 



using Lemma 6.3[ ii), we can choose K' and Tx sufficiently large, and then Si sufficiently 



small, so that, uniformly on the initial states '^£{0) G bxiSi,S2), 

liminf P{^^(Tk) G bx' \ &2i^} > I - e. 
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Therefore, 

fi'^{bK'\b2K) > (l-e)V(M") 
for all large r, and then for the limiting measure n we must have /i(M'^ \ ^k) > (1 ~ 
e)V(K")- □ 

7. Diffusion scaled process in an critically loaded system, in Halfin-Whitt 

asymptotic regime. 

In this section we consider the following asymptotic regime. The system is critically 
loaded, i.e. the optimal solution to SPP ([T| is such that p = 1. As scaling parameter 
r — )■ oo, assume that the server pool sizes are r/3j (same as throughout the paper), and the 
input rates are A[ = rAj + ^/rli, where the parameters (finite real numbers) {/j} are such 
that — ~^ < 0- Denote by p'^,{A[j} the optimal solution of SPP ([T]), with /3j's and 

Aj's replaced by r/3j and A[, respectively. (This solution is unique, as can be easily seen from 
the CRP condition.) Then, it is easy to check that = 1 + (^ /jZ/j) / y/r = 1 — C/y/r, which 
in turn easily implies that, for any r, the system process is stable with the unique stationary 
distribution. 

We use the definitions of ( [l9| for the diffusion scaled variables, and add to them the 
following ones: X[(t) = {Xl(t) — 4'*r)/y/r for the (diffusion-scaled) number of type i cus- 
tomers; Qiit) = Q^{t)/y/r for the type i queue length; Zj(t) = Zj(t)/y/r, where Zj(t) = 
"^^jit) — rPj < is the number of idle servers of type j (with the minus sign). Note that, 
although the optimal average occupancy of pool j is at p'^r(3j, the quantity Zj(t) measures 
the deviation from full occupancy r(3j. Our choice of signs is such that Qi>0 while Zj < 0. 

We will use the vector notations, such as X^it), as usual. 

Two main results of this section are as follows: (a) it is possible for the invariant distribu- 
tions to escape to infinity under certain system parameters and (b) in the special case when 
service rate depends on the server type only, the invariant distributions are tight. 

7.1. Example of evanescence of invariant measures. Recall that vr denotes the (ma- 
trix of) orthogonal projection on the subspace L = {y E \ YliVi ~ 0} ^^^^ 
the projection "along" the direction of vector (1, . . . , 1)"''. Also recall the relation between 
matrices and A^. 

Ac = T^Ay^. 

One more notation: for y G M'^, 

Analogously to Theorem |6.1[ the following fact is a corollary (this time - direct) of Theorem 
3.1 and Theorem 4.4 in [7]. 

Theorem 7.1. Assume that as r ^ oo, X^(0) Xx(0) and ^^(0) (0), where Xx{0) 

and \i/£:(0) are deterministic and finite. Then, 

(24) Xii-) =^ Xx(-) znD'[0,oo), 
and for any fixed f] > 0, 

(25) %{■) =^ MF[Xx(.)] ^nD'+■'-'[v,oc), 
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where Xx{-) is the unique solution of the SDE 

(26) Xxit) = Xx(0) + / AuF[Xx{s)]ds + (v^5,(t)), 



and the processes Bi{-) are independent standard Brownian motions. 

Next we establish the following fact. 

Lemma 7.2. There exists a system and a parameter setting such that the following holds, 
(i) Matrix is unstable. 

(a) Matrix A^ has (1, . . . , 1)"'" as a right eigenvector, with real non-zero eigenvalue c: 
(27) A„(l,...,l)t = c(l,...,l)t. 

Proof. Let us start with the system in the local instability example 2 (see Figure |4]) for the 
critical load. We will modify it as follows. We will change from 100 to 100 — e with 
sufficiently small positive e, so that A^ remains unstable. (The reason for this change will be 
explained shortly.) We will add two new server pools, and 5, on the left and on the right, 
respectively, and set fiAo = 100, /i^s = 1; such addition of server- leaves does not change the 
instability of Ac- So, (i) holds. 

Now, suppose all Aj are equal, say Aj = 1. We can choose ip*j such that all ip* = J2j i^ij 
are equal, and Ylj f^iji^ij = -^j = 1 ior all i. Namely, we do the following. The reason for 
changing fiD^ from 100 to 100 — e is to make it possible to choose ip}j^ > and > 0, such 
that f^Djihj = 1 and = V'ds + ^P*d4 > 1/100- We choose = 1/100 - 6, = 1005 
(which guarantees J2j f^Aji^Aj = 1) with 6 > small enough so that ipA = 1/100 + 995 < 
1/(100 — e). The values of pairs ('?/'bi; "^52); (^^02; "^cs); (^^£4; V^es); are chosen to be equal 
to (V'ao' V'ai)- Finally, we choose V'ds = (1 ~ 5i)/(100 — e) and iphi — ^1/^^^ (which ensures 
Z]j f^Dji'hj = 1) wi^h 5i > satisfying 

= (1 - 5i)/(100 - e) + (5i/10^ = 1/100 + 996 = 

This completes the choice of iply 

We set Pj = Ylii^ij- We see that {iplj) is the equilibrium point. It follows from the 
construction that (27) will hold for Au. Indeed, if ipi — ipx = ci(l, . . . , 1)''', then ipx = 
C2'0x, which in turn means that tps = C2'ip£', therefore, the corresponding service rates are 
f^iji'ij = C2 Zlj f^iji'ij = C2Ai = C2 for all i; therefore, ipx = (l - C2)(l, . . . , 1)^ □ 



Theorem 7.3. Suppose we have a system with parameters satisfying Lemma 1/2,, in the 



Halfin-Whitt regime, described in this section. Then, the sequence of stationary distributions 
of Xj (and of'^j) escapes to infinity: the measure of any compact set vanishes. 

Proof. Since (1, . . . , 1)"'' is an eigenvector of Au, for any y eM.^ we have: 

7cAuF[y] = TvAuTiy = A^ny. 

Then, taking the vr-projection of equation (26), we see that irXx satisfies the following linear 
SDE 

(28) nXxit) = irXxiO) + [ A,nXx{s)ds + 7r(v^5,(t)). 



Given instability of linear equation (28), we can repeat the argument of Section 6.2 



to 



show that the sequence of projections of the stationary distributions of Xj on L escapes to 
infinity. □ 
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7.2. Tightness of stationary distributions in the case when service rate depends 
on the server type only. In this section we consider a special case when there exists a 
set of positive rates {/ij}, such that /i^ = fij as long as (ij) G £. We demonstrate tightness 
of invariant distributions. (An analogous result holds for the underload system, p < 1, as 
sketched out at the end of this section.) This, in combination with the transient diffusion 
limit results, allows us to claim that the limit of invariant distributions is the invariant 
distribution of the limiting diffusion process. 

Theorem 7.4. Suppose fiij = fij, {ij) G S and p = 1. Consider a system under the 
LQFS-LB rule in the asymptotic regime defined above in this section. Then, for any real 



^.Ai + (maXj/ij)^^./3/ 
the stationary distributions are such that 

limsupE[^exp(^Q[) + /3j- exp(^Z]y/3j)] < oo. 

r 

* J 

Proof. Note that the statement is trivial for ^ = 0. Also, for ^ > each term exp{6Zj / I3j) 

is bounded so has finite expectation, while for 6^ < each term exp{6Ql) is bounded so has 
finite expectation. 

Our method is related to that in f^. (The exposition below is self-contained.) 

Step 1: preliminary bounds. Consider the embedded Markov chain taken at the 
instants of (say, right after) the transitions. We will use uniformization, that is, we keep 
the total rate of all transitions from any state constant at a^r = A[ + Y2j fPjf^*^ where 
fi* = max/ij-; note that, as r — )■ oo, a'' — > a* = + YljPjl^*- The transitions are of 

three types: arrivals, departures, and virtual transitions, which do not change the state of 
the system. The rate of a transition due to a type i arrival is A[; for the service completion 
at pool i the rate is Pjir^j + ZJ) (recall < 0); and a virtual transition occurs at the 
complementary rate a^r — YliiK ~ l^ji'^^Pj + -^J)- (Obviously, the probability that a 
transition occurring at a transition instant has a given type is the ratio of the corresponding 
rate and a^r.) The stationary distribution of the embedded Markov chain is the same as 
that of the original, continuous-time chain. 

In the rest of the proof, r G {0,1,2,...} refers to the discrete time of the embedded 
Markov chain. 

We will work with the following Lyapunov function 

(29) C{t) ■.= Y,eMOQ\{r)) + /3, exp(^Z;(r)//3,). 

Throughout, we use the bound 

(30) exp(%) < exp(0x) + e{y - x) + ^e^{y - x)^ exp(0 \y - x\ 

which arises from the second-order Taylor expansion of exp{6y). 

A priori we do not know that E[£(r)] exists for 6 > 0. Indeed, while Zj{t) is bounded for 

any r (above by and below by —(3j\/r), the scaled queue size Qlit) is unbounded. To deal 
with this, we also consider the truncated Lyapunov function = min{£, K}. 
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In the equation below, let x denote the variable of interest (either Ql or Zj/f3j), and let 



S{t) denote the state of the embedded Markov chain at time r. From (30) we obtain 



E[exp{ex{T + 1)) - exp{ex{T))\S{T)] < 

exp{ex{T)){eE[x{T + 1) - x{t)\S{t)] + 

h^E [{x{r + 1) - x(r))2 exp(0 |x(r + 1) - x(r)|)|^(r)]). 

Since for both and Ql the change in a single transition is bounded by I/a/t, we conclude: 

(31) E[exp{eQ:{T + 1)) - exp{eQl{r))\S{r)] < 

expieQ:{T)){9EmT + l)-Q:{T)\S{r)]+(h'exp{9/V^)] ^), 



(32) E[/3,exp(^Z;(r + l)//3,)-/3,exp(^Z;(r)//3,)|5(r)] < 

exp(^Z;(r)//3,)(^E[ZJ(r + 1) - ZJ(r)|5(r)] + ( —6' exp{e / V^)) ]-) 

Clearly, as long as values of 9 are bounded, 
on C2) large r, the second summands in (31) and (|32j) are upper bounded by C2|6'^^ and 
-^C2^9'^l, respectively, where = mmj Note that the second bound is independent of 
J- 

Next, we will obtain an upper bound on the drift 



(3,2 7 r' 

br any fixed C2 > I and all sufficiently (depending 



E[£(r + l)-£(r)|S(r)]. 

To do that, we introduce an artificial scheduling/routing rule, which acts only within one 
time step, and is such that the increment C{t + 1) — C{t) under this rule is "almost" a 
(pathwise, w.p.l) upper bound on this increment under the actual - LQFS-LB - rule. (It is 
important to keep in mind that the artificial rule is not a rule that is applied continuously. 
It is limited to one time step, and its sole purpose is to derive a pathwise upper bound on 
the increment £(r + 1) — C{t) within one time step.) 

Step 2: Artificial scheduling/routing rule. We will use the following notation: X+ = 
X+(r) := {t : QUr) > 0}, Xq = Xo(r) := {t : QUr) = 0}, = J^{r) := {j : %(r) < 0}, 

Jo = Mr)-.= {j:Z;(r) = 0}. 

Scheduling: Departures from servers j G J71 are processed normally, i.e. reduce the 
corresponding Zj{T) by 1. Whenever there is a departure from a server pool j G JTq, the 
server takes up a customer of type i with probability A^^Xli-^ij; keeping Zj{t + 1) = 
and reducing Q'^^t + 1) = Q^{t) — 1. However, if it happens that the chosen i is such that 
Q[(r) = 0, i.e. i G Xq, then we keep Q''^{t + 1) = Qiir) = and instead allow ZJ(r + l) = —1. 

Routing: Arrivals to customer types z G X+ are processed normally, i.e. increase the 
corresponding Qiir) by 1. Whenever there is an arrival to a customer type i G Xq, it 
is routed to server pool j with probability X^j/X^, keeping Q^i^r + 1) = Qi{r) = and 
increasing Zj{t + 1) = Zj{t) + 1. However, if it happens that the chosen j is such that 
ZJ(r) = 0, i.e. j G j7o, then we keep Zj{t + 1) = Zj{t) = and instead allow Q^{t+ 1) = 1. 
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Step 3: One time-step drift under the artificial rule. For z G X+, 



nQ:{r + 1) - Ql{T)\S{r)] = (^X: - 5^(/.,r/3, 

or, recalling that 

(33) Yl ^^^3 = ^'^f^M = f^jPAl - Cl^r) 

k 

we obtain 
(34) 



J2k -^kj 



E[g[(r + 1) - g[(r)|S(r)] = _£^i±^^ , ^ X, 



where o(l) is a fixed function, vanishing as r — oo. 

If Qiir) = (i.e. i G Xq), and a new type i arrival is routed to pool j with Zj{t) < 

(i.e. j G J71), then of course Ql stays at and Qi^r + 1) — Qi^r) = 0. However, if a 
new type i arrival has to be routed to j G j7o, then (by the definition of artificial rule) 
QI{t + 1) - QI{t) = QI{t + 1) = Thus, we can write: 



(35) 



K 1 



E[g[(r + 1) - Q[(r)|5(r)] = ^ — , . G Xo 



Note that the RHS of (35) is of order l/y/r, not 1/r. However, we will see shortly that order 
terms in E[£(r + 1) — C{t)\S{t)] cancel out, and this expected drift is in fact of order 

1/r. 



The treatment of the drift of is similar (and again makes use of (33)). We obtain: 
(36) 



E[Z;(r + 1) - ZJ(r)|^(r)] = -^/i,(^;(r) + /3,C) j G 



(37) 

E[Z;(r + l)-Z;(r)|5(r)] 



We can rewrite (37) as 



(38) E[ZJ(r + 1) - ZJ(r)|5(r)] = - E " ^ ^^'^ ' ^ "^'^ ^ ^o, 

where o(l) is a fixed function, vanishing as r — t- oo. 

Note that if £(r) > K then £^(r + 1) - £^(r) < 0, and if £(r) < K then £^(r + 1) 



^ (t) ^ -^(t + 1) ~ -^(t)- Putting together this observation and equations (31) - (32), (34) 



38^, we obtain 
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(39a) E[£^(r + 1) - £^(r)|^(r)] < 

(39b) ^cir)<K} {J2 eMOQKrM-^^'^^ + ^^^^^^ ^ 



(39c) + ^ ^^a: 



1 1 



(39d) + ^ exp(#Z;(r)/ft)»|-^]|zj(^) + /J^c] 

(39e) 9|-A. 1 1 C'A.d+od))!, 



(39f) +E^^P(^^^(^))(?^')^ 



(39g) +eI^-p(^^;(-)//5^)(t^')^) 



/3* 



Note that the 0(l/y^) terms in (39c) and (39e) cancel each other as promised, so there are 
no ©(I/a/t) terms in the final bound. 

Step 4: One time-step drift under the LQFS-LB rule. We now explain in what 
sense the increment £(r + 1) — 'C(r) under the artificial rule is "almost" an upper bound on 
this increment under LQFS-LB. To illustrate the idea, suppose first that all /3j are equal. 
Then, it is easy to observe that for any fixed S{t\ the increment £(r + 1) — >C(t) under 
the artificial rule is (with probability 1) an upper bound of this increment under LQFS- 
LB. Indeed, suppose first that a transition of the Markov chain is associated with a service 
completion in server pool j with ZJ = 0. (If < 0, there is no difference in what the two 
rules do.) The only case of interest is when the LQFS-LB "takes" a new customer for service 
from queue i with Q\ > 0, while the artificial rule tries to take a customer from a different 
queue i'. Then > Q^, must hold, with > Q\i being the non-trivial case. If Q\, > 0, 
then the LQFS-LB will decrease the larger queue, and so the increment £(r + 1) — £(r) 
under the LQFS-LB is smaller (which is true for both positive and negative 9). If Q^, = 0, 
then the LQFS-LB will still decrease queue Q^, while the artificial rule will instead decrease 
Zj] using convexity of e^^, we verify that, again, the increment £(r + 1) — £(r) under the 
LQFS-LB is smaller (for both positive and negative 6). If transition of the Markov chain 
is associated with a new customer arrival, we use analogous argument to show that, again, 
the increment £(r -|- 1) — C{t) under the LQFS-LB cannot be greater than that under the 
artificial rule. We conclude that when all (3j are equal, the key estimate (39) of the espected 
drift holds, in exactly same form, for LQFS-LB rule as well. 

Now consider the case of general Pj. In the event of a service completion (and then possibly 
taking a customer for service from one of the non-zero queues), the increment C{t + 1) — C{t) 
under LQFS-LB is still no greater than under the artificial rule. (Verified similarly to the 
case of all f3j being equal.) The only situation when LQFS-LB can possibly cause a greater 
increment than the artificial rule is as follows. There is an arrival of a type i customer, 
which the artificial rule routes to pool j with Zj < 0, but the LQFS-LB will instead route it 
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to pool k such that Zj/j3j > Zl/j3k- Given convexity of function e , the "worst case", i.e. 
the largest increment of £(r + 1) — £(r), occurs when ZJl is such that the equality holds, 
^j/t^j ~ ^kl Pk- (If 6* > the positive increment gets larger, if we were to increase Z^; 
if < the negative increment gets smaller in absolute value, if we were to increase Z^'. 
Note also that here we allow Z^, determined by the equality, to be such that Zl = Zly/r 
is possibly non-integer, because we only use this value of Zl to estimate the increment of 
a function.) Thus, as we replace the artificial rule by LQFS-LB, in the "worst case", the 
increment 

/3, exp(e[Z;(r) + r-i/2]//3,.) - exp(^Z;(r)//3,.) 
may need to be replaced by 

Pkexpie[ZliT) + T-^l^\lh) - /3fcexp(^^Z,^(r)//3,), 
with Z].{j) satisfying Z'j{T)/ (3j = ZI{t)/ (3k- In this case we obtain 

(40) h eMOZl{r + l)/^) - h exp(^^4:(r)//3,) < 

1 1 
Jk2 



exp{dZl{T)/Pk){er-''^+[^~e'eMO/V^)) ^) < 

exp(^Z;(r)//3,)(0r-V2+ (^i_i^2^xp(e/v/f)) 1^^ 



This means that, under LQFS-LB rule, the estimate (39) still holds 



Step 5: Exponential moments estimates. Next, note that for each fixed i^' > and 
each fixed parameter r, the values of exp(6'(5[(T)) are uniformly bounded over all states S{t) 
satisfying condition C{t) < K; the values of exp(6'ZJ(r)//3j) are "automatically" uniformly 
bounded (for a fixed r). We take the expected values of both parts of (39) with respect to 
the invariant distribution. The expectation of the LHS is of course 0, and so we get rid of 
the factor 1/r from the RHS expectation. The resulting estimates we will write separately 
for the cases 6 > and < (with the case ^ = being trivial). 

Case 6 > 0. For a fixed 6 > 0, the expected value of the sum of all terms not containing 
exp(^Q[(r)) is bounded (uniformly in r). Indeed, this follows from the facts that (t) < 

and < -0Z;(r)exp(^ZJ(r)//3j) < (3j/e (because > xe^ > for x < 0). Then, we 
obtain: 



(41) 



E 



-{Cir)<K} exp(^Q[(r)) 

iex+ 



CA,(l + o(l)) 



e 



for some constant C\ = Ci{6) > 0, uniformly on all sufficiently large r. Now let us fix a 
sufficiently small positive 6, so that all coefficients of exp(6'Q[(r)) are at least some e > 
(for all large r). Recalling that C2 > 1 can be arbitrarily close to 1, it suffices that 6 < Oq = 
2(mini Ai)/a*. Then, 



E 



L{£(r)<i^} 



^exp 



{r)) 
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from where, letting K — )■ oo, by monotone convergence, we obtain 
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(42) 



E 



< Ci/e < oo. 



uniformly on all large r. 

Case ^ < 0. Fix arbitrary ^ < 0. In this case, the expected value of the sum of all terms 
not containing exp(^Zj'(r)), is bounded (uniformly on r). We can write: 



(43) E 



Mcir)<K} E exp(eZJ(r)//3,) ( e[^][Z^^{T) + P,C] - 



1 c.. 



<c[, 



for some constant C[ = C[{6) > 0, uniformly on all sufficiently large r. Let us choose 
sufficiently large Ki > 0, such that the condition -^J(t) < —Ki implies that 



1 C- 



2/)2 



for some e > (and all large r). Then, from (43) 



E 



Hc{t)<k} 



E ^{zUr)<-K^} exp(^^;(^)//3j) 



jeJ- 



< C[/e, 



from where, letting K — )■ oo, by monotone convergence, we obtain 



E 



exp(^Z[(r)//3,) 



< C[/e < oo. 



uniformly on all large r, which implies the required result. □ 
Corollary 7.5. The sequence of stationary distributions of the processes (^{Ql {■)) , {Zj {■)) 

has a weak limit, which is the unique stationary distribution of the limiting process (^{Qi{-)), {Zj{-)) j , 
described as follows: 

4(t)=max{f(t)//,0}, V^, Z,it)=mm{^^Yit),0}, Vj, 

where Y{-) is a one- dimensional diffusion process with constant variance parameter 2 Aj 
and piece-wise linear drift, equal at point x to 



The invariant distribution density is, then, a continuous function, which is a "concatenation" 
at point of exponential (for x > 0) and Gaussian (for x <0) distribution densities. 



Proof. Theorem 
Then, it follows 



7.4 



of course implies tightness of stationary distributions of [{Qi{-)), {Zj{-))j . 

Trom |9l Theorem 8.5.1] (whose conditions are easily verified in our case), 
that as r — )• oo, any weak limit of the sequence of stationary distributions of the processes 
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[{Qi{-)),{Zj{-))j is a stationary distribution of the limit process, described in [71 Theo- 
rem 4.4], and therefore is the one- dimensional diffusion specified in the statement of the 
corollary. □ 



Finally, we remark that a tightness result analogous to Theorem 7.4 holds for the under- 
loaded system, p < 1, and can be proved essentially the same way. 

The asymptotic regime in this case is such that A[ = rAj (there is no point in considering 
0{^/r) terms in A[ when p < 1). We denote Zj{t) = ^j(i) — rPjp (which is consistent with 
the definition given earlier in this section for p = 1), and keep notation Qi{t) for the queue 
length. We work with the following Lyapunov function: 



The same approach as in the proof of Theorem 7.4 leads to the following result: for any real 
0, 

lim sup E [^^^ exp{9Z!j)] < oo. 

r 

The limiting process for (^J(-)) is {Zj{-)) = {^^Y {■)) , with F(-) being a one-dimensional 
Ornstein-Uhlenbeck process, with Gaussian stationary distribution. The limit of stationary 
distributions of {Zj{-)) is the stationary distribution of {Zj{-)). 

Acknowledgment. Authors would like to thank the referees for useful comments that 
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